

**Doctoral Thesis Defense** 

Ph.D. Program in Computer and Control Engineering (32.nd cycle)

# **Optimization Tools for ConvNets on the Edge**

#### **Valentino Peluso**

#### **Supervisors**

Prof. Enrico Macii, Supervisor Prof. Andrea Calimera, Co-supervisor

# **Sensing and Sensemaking**



Infrared Cameras

#### IoT: Good in sensing, Poor in sensemaking

#### The value of AI



[Source] *McKinsey* 

# **Edge-AI for the IoT**

#### Sense-making:

- Present: in-the-cloud
- Future: at-the-edge



#### Edge Computing

- ✓ Reduce response time
- ✓ Save transmission energy
- ✓ Improve privacy&security



# Making sense of data

- Convolution Neural Networks (ConvNets) achieved human-level accuracy
  - End-to-end learning, i.e. automatic features selection
- Designing ConvNets:
  - Training: learn a proper set of parameters (*W*, *b*) using Back-Propagation
  - Inference: Feed-forward execution of the net



# **Applications and Hardware**

Activity recognition Image classification Object Detection Anomaly detection Face recognition Segmentation **Keyword Spotting** Style Transfer Autonomous navigation **Microcontrollers (MCUs)** ASICs/DSPs **Embedded CPUs** 10-100mW ~3.5W ~10W <1MB ~2GB ~4GB ✓ Large diffusion **Power/Performance** Low Cost stability ✓ Stable toolchains Low Energy × High Cost × Low Thermal Design × Low Memory × Unstable toolchains Power × Low Performance

#### **ConvNets are huge!**



# **Existing tools for Neural Network Optimization**

#### 1) Topology Optimization

- Manual or Automatic (NAS)
- 2) Pruning
  - Filter Pruning
  - Weight Pruning
- 3) Quantization
  - Floating-Point → Fixed-Point
  - Bit-width (1-, 2-, 3-, 4-, 8-bit)

 $\rightarrow$ Joint application to maximize savings





# Challenges

#### **Multi-objective optimization**





#### Hardware diversity



<0.1

2017



#### Edge architecture, % CPU Edge Hardware, total market, \$billion ASIC 4-4.5 **FPGA** 30 Other 20 10 10 2025 2017 2025 [Source] *McKinsey*

#### 2012

2017

~5 years

AlexNet: 1<sup>st</sup> place on ImageNet



# Modular collection of optimization tools



# **1. MEMORY OPTIMIZATION**

# Challenges

- **Goal:** Edge inference on ultra low-power MCUs.
- **Challenge:** Extreme memory constraints
  - ConvNet Parameters (Flash and RAM)
    - 500K to 100M of parameters
  - ConvNet intermediate results (RAM)
- Limitation:
  - × Limited ISA: minimum bit-width is 8-bit

Power



# Prune and Quantize (PaQ)

 Motivation: Identify the best combination of pruning and quantization for memoryconstrained applications.

1. MEMORY



[Source] *Clip-q: Deep network compression learning by in-parallel pruning-quantization,* F. Tung et al., CVPR18 13

### **Prune and Quantize: Results**

- Parametric design-space exploration
  - Bit-width: 16- down to 2-bit
    - 8-bit tested on-device
    - Other bit-widths via emulation
  - Memory (Mem.)

|                         | Im  | age Clas             | sificatior       | n on CIFAR       | -10         |                         |
|-------------------------|-----|----------------------|------------------|------------------|-------------|-------------------------|
|                         |     | Optimal<br>Bit-width | Optimal<br>Top-1 | ARM<br>Bit-width | ARM<br>Loss |                         |
|                         | 245 | 15                   | 83.10            | 8                | 0.25        | For most solutions      |
| <b>3x compression</b>   | 115 | 7                    | 82.64            | 8                | 0.20        | 8-bit has marginal loss |
| <1% accuracy loss       | 98  | 7                    | 81.99            | 8                | 0.59        |                         |
|                         | 82  | 6                    | 81.49            | 8                | 0.70        |                         |
|                         | 66  | 6                    | 80.42            | 8                | 1.57        | We need custom HW       |
|                         | 49  | 5                    | 78.17            | 8                | 6.53        | at extreme constraints  |
|                         | 33  | 5                    | 71.85            | 8                | 17.17       |                         |
| Not supported by MCUs ← |     |                      |                  |                  |             |                         |

## **Encoding-Aware Sparse Training**

Goal: Reduce size of ConvNet Parameters

#### SoA: Sparse Training + Weight Encoding



## **Encoding-Aware Pruning**



## **Sparse Training**



| Hyper-parameters  | Notation              | Initial value |
|-------------------|-----------------------|---------------|
| Target Memory     | <i>M</i> <sub>t</sub> | 12—112KB      |
| Pruning Frequency | Ν                     | 1             |
| Target Sparsity   | S <sub>t</sub>        | 30%           |
| Group Size        | GS                    | 1             |

[Source] To prune, or not to prune: exploring the efficacy of pruning for model compression, M. Zhu et al., arXiv 2017 17



#### Lower Sparsity = Higher Accuracy?

### **Yes! Lower sparsity = Higher accuracy**

 $M_t$ : Target Memory CR: Compression Ratio  $S_x$ : Sparsity  $A_x$ : Accuracy  $\Delta A$ : Accuracy difference

#### **ResNet-9 on CIFAR10**

|                   | $\Delta A$    | $A_{\mathrm{EAST}}$ | $A_{\mathbf{WP}}$ | $S_{\mathrm{EAST}}$ | $S_{\mathbf{WP}}$ | CR            | $M_{\mathbf{t}}$ |
|-------------------|---------------|---------------------|-------------------|---------------------|-------------------|---------------|------------------|
| Similar accuracy  | -0.34%        | 89.46%              | 89.80%            | 49.5%               | 58.5%             | 5.0 	imes     | 112              |
| for larger memory | -0.06%        | 88.61%              | 88.67%            | 60.5%               | 76.0%             | 7.0 	imes     | 80               |
|                   | -0.07%        | 87.44%              | 87.51%            | 74.8%               | 89.5%             | $11.6 \times$ | 48               |
|                   | 0.02%         | 86.82%              | 86.80%            | 79.0%               | 92.0%             | $14.0 \times$ | 40               |
| _                 | 0.81%         | 86.11%              | 85.30%            | 83.3%               | 94.0%             | $17.4 \times$ | 32               |
| Better accuracy   | 1.32%         | 83.65%              | 82.33%            | 87.8%               | 96.0%             | $23.3 \times$ | 24               |
| for tighter memor | 1.48%         | 81.11%              | 79.63%            | 90.0%               | 96.8%             | $27.9 \times$ | 20               |
|                   | <b>4.29</b> % | 78.45%              | 74.16%            | 91.8%               | 97.5%             | $34.9 \times$ | 16               |
| _                 | <b>8.73</b> % | 64.32%              | ▶ 55.59%          | 94.0%               | 98.3%             | $46.5 \times$ | 12               |

# **2. ENERGY OPTIMIZATION**

# **Adaptive ConvNets**

- Motivation: SoA ConvNets are designed and deployed as static graphs
- Goal: Adaptive ConvNets
- Contributions:
  - 1) Online Precision Scaling
  - 2) Scalable-Effort ConvNets



# **Enable Effort-Accuracy Scaling**

- Improve/Reduce accuracy → Reduce/Increase effort, hence energy
  - Knob: dynamic precision scaling
  - Granularity: per-layer
  - Key Feature: single weight-set



# **Per-layer precision assignment**

#### Why per-layer?

accuracy space

- Define multiple operating points
- Fine-grain control on effort-accuracy trade-off
- **Objective:**

tull half half full Which Precision?

#### Identify Pareto optimal configurations in the energy-**2** precision options:

full (16-bit)

| t)                  | • half (8-bi               |         | tion    | Classificat | mageNet   | I            |
|---------------------|----------------------------|---------|---------|-------------|-----------|--------------|
|                     | Ļ                          | #Layers | #Cycles | #Params     | FP32 Acc. | ConvNet      |
| . × 10 <sup>6</sup> | $\rightarrow 2^{21} = 2.1$ | 21      | 29.35M  | 11.68M      | 69.13%    | ResNet18     |
|                     |                            | 37      | 57.28M  | 21.78M      | 72.69%    | ResNet34     |
| d heuristics!       | We nee                     | 54      | 74.40   | 25.50M      | 74.10%    | ResNet50     |
|                     |                            | 32      | 6.45M   | 1.23M       | 56.36%    | SqueezeNet   |
|                     | $\rightarrow 2^{54} = 1.8$ | 54      | 12.13M  | 3.47M       | 69.98%    | MobileNet v2 |
| 23                  |                            |         |         |             |           |              |

#### **Online Precision Scaling: Design**



### **Online Precision Scaling: Results**



| Benchmark       | <b>#Points</b> | $\Delta$ Top-1                 | Savings     | Ex. Time                       |
|-----------------|----------------|--------------------------------|-------------|--------------------------------|
| ResNet18 v1     | 6              | 2.5%                           | 27.4%       | $8 \min 32 s$                  |
| ResNet34 v1     | 6              | 3.3%                           | 29.8%       | $12 \min 36 \mathrm{s}$        |
| ResNet50 v1     | 6              | 2.0%                           | 25.6%       | $25\mathrm{min}17\mathrm{s}$   |
| SqueezeNet v1.1 | 5              | 1.3%                           | 28.4%       | $6 \min 19 \mathrm{s}$         |
| MobileNet v2    | 5              | 8.0%                           | 19.2%       | $14\mathrm{min}\;33\mathrm{s}$ |
|                 |                |                                | (           |                                |
|                 |                | Depthwise C                    | Convolution | ConvNet7                       |
|                 |                | need high-precision Ex. Time 🕫 |             |                                |

#### **Beyond Energy-Accuracy Scaling: Brain Teaching**



# **Static vs Dynamic**

- SoA: Hierarchical ConvNets
  - Tune the computational effort depending on the complexity of the input
    - E.g. drop some filter/layer at run-time



features clearly visible



features are "masked"



# **Training Data-Sets are Hierarchical**

- Common datasets reflects the semantic abstraction of human reasoning
  - E.g. ImageNet: 1000 classes, 16 levels of abstraction



# Multilevel classification with ConvNets

• E.g. Image Classification in CIFAR-10



P(animal) = P(bird) + P(cat) + P(dog) + P(deer) + P(horse) + P(frog)

### Scalable Effort ConvNets: Results (1)

#### Multi-level Classification on ImageNet





# Scalable Effort ConvNets: Results (2)

#### Adaptive ConvNets

- Multilevel Classification  $\rightarrow$  increase accuracy with same effort
- Per-layer Precision Scaling → define multiple points in the energy-accuracy space



# **3. POWER OPTIMIZATION**

### Motivations

#### **1.** Temperature

Embedded SoCs have limited TDP

 $\rightarrow$  High temperature when running intensive workloads (e.g. inference)

 $\rightarrow$  Peak-performance for short run-time windows.

2. Energy

Energy reduction via power minimization



**Neglected by** 

**SoA NN optimization** 

Goal:

# **Thermal-Aware DVFS**

- **Problem:** Data Analytics on a stream-of-data  $\rightarrow$  Continuous Inference
- Challenge: Mobile SoCs have limited TDP



- Thermal-Aware DVFS: Reactive vs Proactive
- **Goal:** Identify the optimal VF operating point

#### What about ConvNets?

3. POWER

# Voltage-Scaled ConvNets on ARM Cortex-A15



#### 1. Quantify thermal headroom

| ConvNet      | $N_{safe}$ | t <sub>safe</sub> (s) |  |  |
|--------------|------------|-----------------------|--|--|
| MobileNet v1 | 39         | 1.26                  |  |  |
| MobileNet v2 | 42         | 1.27                  |  |  |
| Inception v1 | 25         | 2.21                  |  |  |
| Inception v4 | 4          | 2.93                  |  |  |
| T<90°C       |            |                       |  |  |

2. Assess latency under thermal-aware DVFS

#### **3.** Demonstrate thermal profile depend on topology



3. POWER

#### **Beyond DVFS: FINE-VH**





FU = Functional Unit

#### **FINE-VH outperforms DVFS**

- Limited area overhead: 6% w.r.t. standard flow
- From 32.0% to 38.2% w.r.t. ideal-DVFS



#### 15-30 rows!

3. POWER

- × Power Distribution
- × Layout Fragmentation
- × No Level-shifters

- <u>How:</u> Fully automated design and simulation flow integrated on a standard EDA tool
- Validated on:
  - RISC Core
  - Deep Learning Accelerator

# Wrap-up



#### 1. MEMORY



Prune and Quantize



Memory vs. Accuracy design-space exploration

3\$ HW is enough: 3x compression with <1% loss



Encoding-Aware Sparse Training



Maximize compression of encoding schemes

+8.73% accuracy at 12KB





Contine Precision Scaling



Up to 35.2% savings with <8% loss



#### **3. POWER**



Voltage-Scaled ConvNets



Performance profiling under thermal-aware DVFS

Look at Temperature! Safe latency: 1-3s



Scalable-Effort ConvNets



Dynamic Energy-Accuracy-Abstraction Scaling

40% more accurate or 60% more efficient





Novel power distribution scheme to improve DVFS

Up to 38.2% power savings



## **The Lesson Learnt**

#### The definitive solution does not exist!

#### 

**Present: Exploratory Data Analysis** 

- Data Collection/Cleaning
- Data Visualization
- Assess different hypothesis:
  - Hyper-Param. Optimization
  - Learning Strategy
    - Supervised, Self-Supervised, Transfer Learning etc.



#### 

#### **Future: Exploratory Optimization Analysis**

- Design-Space Exploration
  - Accuracy, Memory, Energy, Power...
- Cost Analysis
  - Which HW?
- Assess different hypothesis:
  - NAS
  - Pruning
  - Quantization
  - Static vs. Dynamic...

#### **Research Activities**



#### **Technical Speaker at:**

- 2 international conferences (ICCAD18 and SNAMS19)
- 1 national workshop (IWES18)



T

Live Demonstrations at 2 international conferences (DATE19 and ISLPED19)



SENSEI - Sensemaking for Scalable IoT Platforms with In-Situ Data-Analytics: A Software-to-Silicon Solution for Energy-Efficient Machine-Learning on Chip (2 years)

# Thank you

### **Question Time**

